Exploring and Summarizing Prosper Loan Data by MAHDI

## [1] 113937     81

The dataset contains 81 variables and 113K+ entries. We’ll keep only 14 columns and 10K entries for a reasonable processing time. Chosen columns should help achieve our goal.

## [1] 113937     81
##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"
## [1] 10000    81
## [1] 10000    16

Univariate Plots Section

## 'data.frame':    10000 obs. of  16 variables:
##  $ ProsperScore             : num  3 7 NA NA NA 6 NA NA NA NA ...
##  $ Term                     : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 2 2 3 3 4 3 3 3 3 ...
##  $ BorrowerRate             : num  0.285 0.205 0.11 0.111 0.2 ...
##  $ LenderYield              : num  0.275 0.1949 0.0999 0.1011 0.19 ...
##  $ ListingCategory..numeric.: int  7 13 4 0 4 2 0 1 6 1 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 2 2 3 3 3 2 3 3 3 3 ...
##  $ EmploymentStatusDuration : int  80 70 39 56 229 26 35 3 5 56 ...
##  $ IsBorrowerHomeowner      : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 1 1 1 1 ...
##  $ CurrentCreditLines       : int  14 12 3 4 5 13 1 8 5 10 ...
##  $ DebtToIncomeRatio        : num  0.09 0.15 0.18 0.13 0.11 0.14 0.11 0.07 0.14 NA ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 5 6 4 5 5 5 4 3 4 1 ...
##  $ StatedMonthlyIncome      : num  5250 7500 3120 5598 4167 ...
##  $ LoanOriginalAmount       : int  2000 4500 9500 15000 2500 10000 2750 1000 7000 5000 ...
##  $ LoanOriginationDate      : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 1843 1394 574 304 608 1846 310 612 532 541 ...
##  $ MonthlyLoanPayment       : num  83.3 168.4 311 491.9 92.9 ...

Prosper Score

# 11 is less risky
# transforming ProsperScore to factor
is.factor(pl_subsamp$ProsperScore)
## [1] FALSE
pl_subsamp$ProsperScore <- factor(pl_subsamp$ProsperScore, levels = c(1:11))

summary(pl_subsamp$ProsperScore)
##    1    2    3    4    5    6    7    8    9   10   11 NA's 
##   93  474  701 1111  857 1057  898 1066  608  437  116 2582
qplot(x = ProsperScore, data = subset(pl_subsamp, !is.na(ProsperScore)),
      xlab = "Prosper Score") +
  scale_x_discrete(breaks = seq(1, 11, 1))

Most loans are of middle risk profiles, and low and high risks are almost equally served. It will be interesting to check how the risk and the borrower rate relate to each other.

Term

pl_subsamp$Term <- factor(pl_subsamp$Term)

summary(pl_subsamp$Term)
##   12   36   60 
##  142 7716 2142
qplot(x = Term, data = pl_subsamp, binwidth=1,
      xlab = "Term in Months")

There are 3 loan terms: 1y, 3y, 5y. Term could be considered as a categorical variable. However, its ditribution shows that there are only 3 terms. So we chose to factorise this variable and consider it as ordinal.

Loan Status

# most of loans are current, very few are defaulted, some are charged off
summary(pl_subsamp$LoanStatus)
##              Cancelled             Chargedoff              Completed 
##                      1                   1056                   3321 
##                Current              Defaulted FinalPaymentInProgress 
##                   5026                    406                     15 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                      0                     62                     27 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                     27                     25                     34
qplot(x = LoanStatus, data = pl_subsamp, 
      fill = LoanStatus, xlab = "Loan Status") +
  scale_x_discrete(breaks = NULL)

count(pl_subsamp[pl_subsamp$LoanStatus == "Defaulted",]) / dim(pl_subsamp)[1]
##        n
## 1 0.0406

Only 4.06% of loans have resulted in a default since 2005.

Borrower Rate

summary(pl_subsamp$BorrowerRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0150  0.1334  0.1815  0.1915  0.2498  0.4500
qplot(x = BorrowerRate*100, data = pl_subsamp,
      binwidth = 1) +
  scale_x_continuous(breaks = seq(0, 36, 2)) +
  xlab("Borrower Rate, %")

qplot(x = BorrowerRate*100, data = pl_subsamp) +
  scale_x_log10() +
  xlab("Borrower Rate, %")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(x = BorrowerRate*100, data = pl_subsamp) +
  scale_x_sqrt() +
  xlab("Borrower Rate, %")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

ggplot(aes(x = BorrowerRate*100),
       data = pl_subsamp) +
  geom_density() +
  xlab("Borrower Rate, %")

3/4 quantile of is ~25%. There is a peak at 31%; what could be the reason?

Lender Yield

# a peak @ 31%
summary(pl_subsamp$LenderYield)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1234  0.1700  0.1815  0.2395  0.4325
qplot(x = LenderYield*100, data = pl_subsamp,
      binwidth = 1, xlab = "Lender Yield %") +
  scale_x_continuous(breaks = seq(0, 36, 2))

qplot(x = LenderYield, data = pl_subsamp) +
  scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(x = LenderYield, data = pl_subsamp) +
  scale_x_sqrt()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

It looks like BorrowerRate.

Listing Categories with meaningful text

# peak for debt consolidation
pl_subsamp$ListingCategory <- pl_subsamp$ListingCategory..numeric.
pl_subsamp$ListingCategory <- 
  plyr::mapvalues(pl_subsamp$ListingCategory,
            from = c(0:20),
            to = c("Not Available", "Debt Consolidation", "Home Improvement",
                   "Business",
                   "Personal Loan", "Student Use", "Auto", "Other",
                   "Baby&Adoption", "Boat", "Cosmetic Procedure",
                   "Engagement Ring", "Green Loans", "Household Expenses",
                   "Large Purchases", "Medical/Dental", "Motorcycle", "RV",
                   "Taxes", "Vacation", "Wedding Loans"))
table(pl_subsamp$ListingCategory..numeric.)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 1494 5122  645  663  200   65  217  917   23    3    6   20    8  180   78 
##   15   16   17   18   19   20 
##  122   21    2   78   70   66
table(pl_subsamp$ListingCategory)
## 
##               Auto      Baby&Adoption               Boat 
##                217                 23                  3 
##           Business Cosmetic Procedure Debt Consolidation 
##                663                  6               5122 
##    Engagement Ring        Green Loans   Home Improvement 
##                 20                  8                645 
## Household Expenses    Large Purchases     Medical/Dental 
##                180                 78                122 
##         Motorcycle      Not Available              Other 
##                 21               1494                917 
##      Personal Loan                 RV        Student Use 
##                200                  2                 65 
##              Taxes           Vacation      Wedding Loans 
##                 78                 70                 66
pl_subsamp.backup.ListingCategory..numeric. <- pl_subsamp$ListingCategory..numeric.
pl_subsamp <- subset(pl_subsamp, select = -ListingCategory..numeric.)
pl_subsamp$ListingCategory <- factor(pl_subsamp$ListingCategory)
dim(pl_subsamp)
## [1] 10000    16
qplot(data = pl_subsamp, x = ListingCategory,
      fill = ListingCategory, xlab = "Listing Category") +
  scale_x_discrete(breaks = NULL)

with(pl_subsamp, sum(ListingCategory == "Debt Consolidation") /
       length(ListingCategory))
## [1] 0.5122

~51% of loans are for debt consolidation.

Employment Status

# most of customers are employed
summary(pl_subsamp$EmploymentStatus)
##                    Employed     Full-time Not available  Not employed 
##           190          5887          2355           452            69 
##         Other     Part-time       Retired Self-employed 
##           364            98            73           512
qplot(x = EmploymentStatus, 
      data = subset(pl_subsamp, EmploymentStatus != ""),
      fill = EmploymentStatus, xlab = "Employment Status") +
  scale_x_discrete(breaks = NULL)

Employed borrowers are more likely to get a loan.

Employment Status Duration

# half of customers have been working for less than 6 years
# right skewed distribution with a mean of 9 years
summary(pl_subsamp$EmploymentStatusDuration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   25.00   67.00   94.95  135.00  634.00     644
qplot(x = EmploymentStatusDuration/12.,
      data = subset(pl_subsamp, !is.na(EmploymentStatusDuration)),
      binwidth = 1,
      xlab = "Employment Status Duration, Years") +
  scale_x_continuous(breaks = seq(0, 800/12., 4))

qplot(x = EmploymentStatusDuration/12.+1/365.,
      data = subset(pl_subsamp, !is.na(EmploymentStatusDuration)),
      binwidth = 1./12.,
      xlab = "Employment Status Duration, Years") +
  scale_x_log10()

qplot(x = EmploymentStatusDuration/12.+1/365.,
      data = subset(pl_subsamp, !is.na(EmploymentStatusDuration)),
      binwidth = 1./12.,
      xlab = "Employment Status Duration, Years") +
  scale_x_sqrt()

ggplot(aes(x = EmploymentStatusDuration/12.),
       data = subset(pl_subsamp, !is.na(EmploymentStatusDuration))) +
  geom_density() +
  xlab("Employment Status Duration, Years")

3/4 quantile of employment duration is 11+ years, mean is ~8- years, median is 5.5 years and 1/4 quantile is ~2+ years. Distribution is log-normal with right skew and many borrowers are fresh workers.

Home Owner?

summary(pl_subsamp$IsBorrowerHomeowner)
## False  True 
##  4910  5090
qplot(data = pl_subsamp, x = IsBorrowerHomeowner)

50.9% of customers own their home. Could this variable have any impact?

Current Credit Lines

# normal distribution
summary(pl_subsamp$CurrentCreditLines)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   10.00   10.38   13.00   54.00     642
qplot(data = pl_subsamp, x = CurrentCreditLines,
      binwidth = 1, xlab = "Current Credit Lines")

qplot(data = pl_subsamp, x = CurrentCreditLines,
      binwidth = 0.01, xlab = "Current Credit Lines") +
  scale_x_log10()

Half of customers have already 10 credit lines opened at the time of the listing. The distribution is normal with a right skew.

Debt To Income Ratio

# skewed distribution to the right
# square root of DebtToIncomeRatio is normal
summary(pl_subsamp$DebtToIncomeRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1400  0.2200  0.2741  0.3100 10.0100     717
sum(subset(pl_subsamp, !is.na(DebtToIncomeRatio))$DebtToIncomeRatio>10)
## [1] 25
qplot(data = subset(pl_subsamp, !is.na(DebtToIncomeRatio)),
      x = DebtToIncomeRatio,
      binwidth = 0.02) +
  xlim(0, quantile(pl_subsamp$DebtToIncomeRatio, 0.99, na.rm = TRUE)) +
  xlab("Debt To Income Ratio")

qplot(data = subset(pl_subsamp, !is.na(DebtToIncomeRatio)),
      x = DebtToIncomeRatio,
      binwidth = 0.15) +
  scale_x_log10() +
  xlab("Debt To Income Ratio")

qplot(data = subset(pl_subsamp, !is.na(DebtToIncomeRatio)),
      x = DebtToIncomeRatio,
      binwidth = 0.04) +
  scale_x_sqrt() +
  xlab("Debt To Income Ratio")

3/4 have a debt to income ratio less than 31%, with 25 exceeding 1000%.

Reordering IncomeRange factor

# middle income tend to borrow more
# low inclome have difficult access to loans
# higher income borrow less
table(pl_subsamp$IncomeRange)
## 
##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##             55            637           1548           2837           2778 
## $75,000-99,999  Not displayed   Not employed 
##           1427            649             69
is.factor(pl_subsamp$IncomeRange)
## [1] TRUE
levels(pl_subsamp$IncomeRange)
## [1] "$0"             "$1-24,999"      "$100,000+"      "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed"  "Not employed"
pl_subsamp.backup.IncomeRange <- pl_subsamp$IncomeRange
pl_subsamp$IncomeRange <- 
  factor(pl_subsamp$IncomeRange,
         levels(pl_subsamp$IncomeRange)[c(7, 8, 1, 2, 4, 5, 6, 3)])
qplot(data = pl_subsamp, x = IncomeRange,
      fill = IncomeRange, xlab = "Income Range") +
  scale_x_discrete(breaks = NULL)

Most of loans are requested by middle-income customers, $25K-$75K.

Stated Monthly Income

summary(pl_subsamp$StatedMonthlyIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3250    4651    5609    6750  394400
qplot(x = StatedMonthlyIncome, data = pl_subsamp,
      geom = "density") +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99))
## Warning: Removed 100 rows containing non-finite values (stat_density).

qplot(x = StatedMonthlyIncome, data = pl_subsamp,
      binwidth = 500) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99))

qplot(x = StatedMonthlyIncome, data = pl_subsamp,
      binwidth = 0.1) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99)) +
  scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

A lognormal distribution.

Loan Original Amount

# some peaks around 4k, 11k, 16k, 21k, 26k
# distribution like bi-modal at 4k and 16k
summary(pl_subsamp$LoanOriginalAmount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8419   12000   35000
qplot(data = pl_subsamp, x = LoanOriginalAmount,
      binwidth = 1000, geom = "density",
      xlab = "Loan Original Amount")

qplot(data = pl_subsamp, x = LoanOriginalAmount,
      binwidth = 1000,
      xlab = "Loan Original Amount") +
  scale_x_continuous(breaks = seq(0, 35000, 5000))

qplot(data = pl_subsamp, x = LoanOriginalAmount,
      binwidth = 5000,
      xlab = "Loan Original Amount") +
  scale_x_continuous(breaks = seq(0, 35000, 5000))

qplot(data = pl_subsamp, x = LoanOriginalAmount,
      binwidth = 0.2,
      xlab = "Loan Original Amount") +
  scale_x_log10()

qplot(data = pl_subsamp, x = LoanOriginalAmount,
      binwidth = 20,
      xlab = "Loan Original Amount") +
  scale_x_sqrt()

There are some peaks for the original amount at 4K, 11K and 16K. Mean is ~$8400.

Loan Origination Date

str(pl_subsamp)
## 'data.frame':    10000 obs. of  16 variables:
##  $ ProsperScore            : Factor w/ 11 levels "1","2","3","4",..: 3 7 NA NA NA 6 NA NA NA NA ...
##  $ Term                    : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LoanStatus              : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 2 2 3 3 4 3 3 3 3 ...
##  $ BorrowerRate            : num  0.285 0.205 0.11 0.111 0.2 ...
##  $ LenderYield             : num  0.275 0.1949 0.0999 0.1011 0.19 ...
##  $ EmploymentStatus        : Factor w/ 9 levels "","Employed",..: 2 2 3 3 3 2 3 3 3 3 ...
##  $ EmploymentStatusDuration: int  80 70 39 56 229 26 35 3 5 56 ...
##  $ IsBorrowerHomeowner     : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 1 1 1 1 ...
##  $ CurrentCreditLines      : int  14 12 3 4 5 13 1 8 5 10 ...
##  $ DebtToIncomeRatio       : num  0.09 0.15 0.18 0.13 0.11 0.14 0.11 0.07 0.14 NA ...
##  $ IncomeRange             : Factor w/ 8 levels "Not displayed",..: 6 7 5 6 6 6 5 8 5 3 ...
##  $ StatedMonthlyIncome     : num  5250 7500 3120 5598 4167 ...
##  $ LoanOriginalAmount      : int  2000 4500 9500 15000 2500 10000 2750 1000 7000 5000 ...
##  $ LoanOriginationDate     : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 1843 1394 574 304 608 1846 310 612 532 541 ...
##  $ MonthlyLoanPayment      : num  83.3 168.4 311 491.9 92.9 ...
##  $ ListingCategory         : Factor w/ 21 levels "Auto","Baby&Adoption",..: 15 10 16 14 16 9 14 6 1 6 ...
pl_subsamp$LoanOriginationDate <-
  as.Date(pl_subsamp$LoanOriginationDate, "%Y-%m-%d")
str(pl_subsamp)
## 'data.frame':    10000 obs. of  16 variables:
##  $ ProsperScore            : Factor w/ 11 levels "1","2","3","4",..: 3 7 NA NA NA 6 NA NA NA NA ...
##  $ Term                    : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LoanStatus              : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 2 2 3 3 4 3 3 3 3 ...
##  $ BorrowerRate            : num  0.285 0.205 0.11 0.111 0.2 ...
##  $ LenderYield             : num  0.275 0.1949 0.0999 0.1011 0.19 ...
##  $ EmploymentStatus        : Factor w/ 9 levels "","Employed",..: 2 2 3 3 3 2 3 3 3 3 ...
##  $ EmploymentStatusDuration: int  80 70 39 56 229 26 35 3 5 56 ...
##  $ IsBorrowerHomeowner     : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 1 1 1 1 ...
##  $ CurrentCreditLines      : int  14 12 3 4 5 13 1 8 5 10 ...
##  $ DebtToIncomeRatio       : num  0.09 0.15 0.18 0.13 0.11 0.14 0.11 0.07 0.14 NA ...
##  $ IncomeRange             : Factor w/ 8 levels "Not displayed",..: 6 7 5 6 6 6 5 8 5 3 ...
##  $ StatedMonthlyIncome     : num  5250 7500 3120 5598 4167 ...
##  $ LoanOriginalAmount      : int  2000 4500 9500 15000 2500 10000 2750 1000 7000 5000 ...
##  $ LoanOriginationDate     : Date, format: "2014-01-28" "2012-04-12" ...
##  $ MonthlyLoanPayment      : num  83.3 168.4 311 491.9 92.9 ...
##  $ ListingCategory         : Factor w/ 21 levels "Auto","Baby&Adoption",..: 15 10 16 14 16 9 14 6 1 6 ...
pl_subsamp.Dates <-
  subset(pl_subsamp, select = c(LoanOriginationDate, BorrowerRate)) %>% 
  separate(LoanOriginationDate, c("Year", "Month", "Day"), sep = "-") %>% 
  dplyr::select(-Day)

pl_subsamp.Dates$YearCut <-
  as.Date(cut(pl_subsamp$LoanOriginationDate, breaks = "year"))
pl_subsamp.Dates$MonthCut <-
  as.Date(cut(pl_subsamp$LoanOriginationDate, breaks = "month"))

qplot(x = Year, data = pl_subsamp.Dates)

pl_subsamp.Dates %>% 
  dplyr::select(c(Month, Year)) %>% 
  group_by(Year) %>% 
  summarise(MinMonth = min(Month), 
            MaxMonth = max(Month))
## Source: local data frame [10 x 3]
## 
##    Year MinMonth MaxMonth
## 1  2005       11       11
## 2  2006       01       12
## 3  2007       01       12
## 4  2008       01       10
## 5  2009       05       12
## 6  2010       01       12
## 7  2011       01       12
## 8  2012       01       12
## 9  2013       01       12
## 10 2014       01       03

A drop in loans in 2009 followed by increasing loans thereafter. This trend can be explained by the continuous decrease in interest rates since the crisis. Data for 2005 and 2014 does not cover all the year, that’s why we don’t have too many observations for these years. No loans were granted between 11/2008 and 04/2009.

Monthly Loan Payment

# lognormal distribution
summary(pl_subsamp$MonthlyLoanPayment)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   134.8   223.0   275.7   372.7  1778.0
qplot(data = pl_subsamp, x = MonthlyLoanPayment,
      binwidth = 150) +
  xlab("Monthly Loan Payment")

qplot(data = pl_subsamp, x = MonthlyLoanPayment,
      binwidth = 0.2) +
  scale_x_log10() +
  xlab("Monthly Loan Payment")

qplot(data = pl_subsamp, x = MonthlyLoanPayment,
      binwidth = 4) +
  scale_x_sqrt() +
  xlab("Monthly Loan Payment")

The mean of monthly loans is $293.3.

Computing fees

# almost all fees are 1%
pl_subsamp$Fees <- with(pl_subsamp, BorrowerRate - LenderYield)
summary(pl_subsamp$Fees)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01000 0.01000 0.01009 0.01000 0.05000
with(pl_subsamp, cor.test(BorrowerRate, LenderYield))
## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and LenderYield
## t = 2536.2, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9991927 0.9992536
## sample estimates:
##       cor 
## 0.9992237

BorrowerRate and LenderYield are heavily correlated. We’ll keep only one of them.

MonthlyDebtAmount = DebtToIncomeRatio * StatedMonthlyIncome

pl_subsamp$MonthlyDebtAmount <-
  with(pl_subsamp, DebtToIncomeRatio * StatedMonthlyIncome)

summary(pl_subsamp$MonthlyDebtAmount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     645    1067    1222    1625   11150     717
qplot(x = MonthlyDebtAmount, data = pl_subsamp, binwidth = 100) +
  scale_x_continuous(limits = c(0, 5000), 
                     breaks = seq(0, 5000, 500)) +
  xlab("Monthly Debt Amount")

qplot(x = MonthlyDebtAmount, data = pl_subsamp) +
  scale_x_log10() +
  xlab("Monthly Debt Amount")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Mean debt amount is ~$1200.

Univariate Analysis

What is the structure of your dataset?

There are 100K+ observations and 81 variables. We decide to keep only 16. Following are the retained variables: ProsperScore, Term, LoanStatus, BorrowerRate, LenderYield, ListingCategory..numeric., EmploymentStatus, EmploymentStatusDuration, IsBorrowerHomeowner, CurrentCreditLines, DebtToIncomeRatio, IncomeRange, StatedMonthlyIncome, LoanOriginalAmount, LoanOriginationDate, MonthlyLoanPayment. ProsperScore is an oridnal variable with values going from 1 (higher risk) to 11 (lower risk). LoanStatus, EmploymentStatus, IsBorrowerHomeowner, ListingCategory are categorical variables. IncomeRange is an interval variable.

The size of the sample will be cut to 10K in order to speed up processing.

What is/are the main feature(s) of interest in your dataset?

I would like to build a model that predicts how much a borrower would pay for a loan. The main feature is BorrowerRate.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I will test all the features in the dataset, and filter them at each step of the analysis. However, I ’ve decided to remove some of them. Here are the reasons: - MonthlyLoanPayment: it can be implied from the loan duration, the rate and amount - ListingCategory: all categories are related to consumption, thus I don’t think it could have an impact on rates - Fees: almost all fees are 1% - LenderYield: linear relationship with BorrowerRate

# deleting some features
pl_subsamp <- subset(pl_subsamp,
                     select = -c(MonthlyLoanPayment, 
                                 ListingCategory, 
                                 LenderYield, Fees))
dim(pl_subsamp)
## [1] 10000    14

Did you create any new variables from existing variables in the dataset?

  • +- Fees = LenderYield - BorrowerRate Most fees (LenderYield-BorrowerRate) are equal to 1% (99.9% quantile). So I decided to remove them afterward along with LenderYield in order to avoid a multicollinearity problem.
    • MonthlyDebtAmount = StatedMonthlyIncome * DebtToIncomeRatio
    • ListingCategory: created by mapping ListingCategory..numeric. to meaningful descriptions

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

LoanOriginalAmount have an unusual distribution. Indeed, although many amounts are in the range $1K-$10K, we observe peaks at 4K, 11K, 16K, 21K and 26K. Groupings the amount in chunks of 5K displays a more bell-like histogram.

Bivariate Plots Section

ProsperScore vs. BorrowerRate

with(pl_subsamp, by(BorrowerRate, ProsperScore, summary))
## ProsperScore: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2100  0.2999  0.3125  0.3042  0.3177  0.3500 
## -------------------------------------------------------- 
## ProsperScore: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1400  0.2555  0.2800  0.2732  0.3096  0.3500 
## -------------------------------------------------------- 
## ProsperScore: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1159  0.2139  0.2469  0.2471  0.2850  0.3500 
## -------------------------------------------------------- 
## ProsperScore: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0769  0.1772  0.2099  0.2236  0.2669  0.3600 
## -------------------------------------------------------- 
## ProsperScore: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0769  0.1685  0.2179  0.2281  0.3100  0.3500 
## -------------------------------------------------------- 
## ProsperScore: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0716  0.1550  0.1939  0.2055  0.2575  0.3500 
## -------------------------------------------------------- 
## ProsperScore: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0799  0.1359  0.1726  0.1839  0.2449  0.3500 
## -------------------------------------------------------- 
## ProsperScore: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0609  0.1139  0.1449  0.1526  0.1774  0.3500 
## -------------------------------------------------------- 
## ProsperScore: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0586  0.0930  0.1151  0.1242  0.1400  0.3500 
## -------------------------------------------------------- 
## ProsperScore: 10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04990 0.07160 0.08790 0.09765 0.11530 0.35000 
## -------------------------------------------------------- 
## ProsperScore: 11
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06050 0.07160 0.09900 0.09699 0.11240 0.18750
qplot(x = ProsperScore, y = BorrowerRate,
      data = subset(pl_subsamp, !is.na(ProsperScore)), 
      geom = "boxplot") +
  xlab("Prosper Score") +
  ylab("Rate")

qplot(x = ProsperScore, y = BorrowerRate,
      data = subset(pl_subsamp, ProsperScore %in% c(10, 11)), 
      geom = "boxplot") +
  xlab("Prosper Score by Term (M)") +
  ylab("Rate") +
  facet_wrap(~Term)

Inversely proportional. However, loans with scoring 11 have higher rates than those scoring 10, which is unintuitive. What could be the reason? The second plot helps us explain this phenomenon. Indeed, there are no loans with a score of 11 and a term of 12 months. 12 months loans happen to be those with the lowest rates.

Term vs. BorrowerRate

with(pl_subsamp, by(BorrowerRate, Term, summary))
## Term: 12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0499  0.0929  0.1499  0.1534  0.2176  0.2669 
## -------------------------------------------------------- 
## Term: 36
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0150  0.1274  0.1799  0.1920  0.2575  0.4500 
## -------------------------------------------------------- 
## Term: 60
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0766  0.1485  0.1850  0.1925  0.2304  0.3304
qplot(x = Term, y = BorrowerRate,
      data = pl_subsamp, geom = "boxplot") +
  xlab("Term (M)") +
  ylab("Rate")

1 year loans have the lower rate. 3 and 5 years loans have almost the same mean, with more volatility on 3 years loans. 3 and 5 years loans means are not very different. Maybe there is more demand on 3 years loans thus putting more pressure on 3 years rates.

LoanStatus vs. BorrowerRate

with(pl_subsamp, by(BorrowerRate, LoanStatus, summary))
## LoanStatus: Cancelled
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.2     0.2     0.2     0.2     0.2     0.2 
## -------------------------------------------------------- 
## LoanStatus: Chargedoff
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.1700  0.2375  0.2331  0.2974  0.3600 
## -------------------------------------------------------- 
## LoanStatus: Completed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0150  0.1177  0.1700  0.1853  0.2500  0.3500 
## -------------------------------------------------------- 
## LoanStatus: Current
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0604  0.1314  0.1765  0.1835  0.2301  0.3304 
## -------------------------------------------------------- 
## LoanStatus: Defaulted
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0600  0.1605  0.2192  0.2202  0.2750  0.4500 
## -------------------------------------------------------- 
## LoanStatus: FinalPaymentInProgress
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0890  0.1232  0.1474  0.1742  0.2124  0.2958 
## -------------------------------------------------------- 
## LoanStatus: Past Due (>120 days)
## NULL
## -------------------------------------------------------- 
## LoanStatus: Past Due (1-15 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0749  0.1665  0.2133  0.2133  0.2578  0.3177 
## -------------------------------------------------------- 
## LoanStatus: Past Due (16-30 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0799  0.1789  0.2489  0.2294  0.2742  0.3177 
## -------------------------------------------------------- 
## LoanStatus: Past Due (31-60 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0649  0.1760  0.2049  0.2152  0.2714  0.3304 
## -------------------------------------------------------- 
## LoanStatus: Past Due (61-90 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1299  0.2148  0.2489  0.2475  0.3149  0.3177 
## -------------------------------------------------------- 
## LoanStatus: Past Due (91-120 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1269  0.1776  0.2298  0.2315  0.2766  0.3435
qplot(x = LoanStatus, y = BorrowerRate,
       data = pl_subsamp, geom = "boxplot", fill = LoanStatus) +
  scale_x_discrete(breaks = NULL) +
  xlab("Loan Status") +
  ylab("Rate")

Even if defaulted loans are among loans with the lowest rates, it does not necessarily mean that they are actually. One possibility, is that those loans were granted during a period of low rates. We’ll try to figure this out in the next section.

EmploymentStatus vs. BorrowerRate

with(pl_subsamp, by(BorrowerRate, EmploymentStatus, summary))
## EmploymentStatus: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0295  0.1270  0.1775  0.1833  0.2375  0.4500 
## -------------------------------------------------------- 
## EmploymentStatus: Employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0499  0.1349  0.1819  0.1913  0.2489  0.3600 
## -------------------------------------------------------- 
## EmploymentStatus: Full-time
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.1200  0.1700  0.1858  0.2457  0.3500 
## -------------------------------------------------------- 
## EmploymentStatus: Not available
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0150  0.1400  0.1928  0.1926  0.2500  0.2900 
## -------------------------------------------------------- 
## EmploymentStatus: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0495  0.1949  0.2511  0.2385  0.3058  0.3500 
## -------------------------------------------------------- 
## EmploymentStatus: Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0604  0.1568  0.2064  0.2122  0.2699  0.3304 
## -------------------------------------------------------- 
## EmploymentStatus: Part-time
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0500  0.1299  0.1662  0.1862  0.2430  0.3500 
## -------------------------------------------------------- 
## EmploymentStatus: Retired
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0628  0.1147  0.1694  0.1821  0.2295  0.3500 
## -------------------------------------------------------- 
## EmploymentStatus: Self-employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0500  0.1435  0.1950  0.2040  0.2699  0.3500
qplot(x = EmploymentStatus, y = BorrowerRate,
      data = subset(pl_subsamp, EmploymentStatus != ""),
      geom = "boxplot", fill = EmploymentStatus) +
  scale_x_discrete(breaks = NULL) +
  xlab("Employment Status") +
  ylab("Rate")

Full/Part-timers and retired pay lower rates. Non-employed pay higher rates.

EmploymentStatusDuration vs. BorrowerRate

ggplot(aes(x = EmploymentStatusDuration, y = BorrowerRate),
      data = pl_subsamp) +
  geom_point(alpha = 1/5) +
  xlab("Employment Duration") +
  ylab("Rate")
## Warning: Removed 644 rows containing missing values (geom_point).

with(pl_subsamp, cor.test(EmploymentStatusDuration, BorrowerRate))
## 
##  Pearson's product-moment correlation
## 
## data:  EmploymentStatusDuration and BorrowerRate
## t = -1.3703, df = 9354, p-value = 0.1706
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.034420102  0.006098648
## sample estimates:
##         cor 
## -0.01416654

We don’t observe any pattern. Moreover the correlation is almost non existent.

BorrowerHomeowner vs. BorrowerRate

with(pl_subsamp, by(BorrowerRate, IsBorrowerHomeowner, summary))
## IsBorrowerHomeowner: False
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0150  0.1435  0.1950  0.2014  0.2599  0.4500 
## -------------------------------------------------------- 
## IsBorrowerHomeowner: True
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0499  0.1239  0.1699  0.1821  0.2346  0.3500
qplot(x = IsBorrowerHomeowner, y = BorrowerRate,
      data = pl_subsamp, geom = "boxplot") +
  xlab("Home Ownership") +
  ylab("Rate")

Home owners pay lower rates in average. Probably because they bring more guarantees to the bank.

CurrentCreditLines vs. BorrowerRate

pl_subsamp.EmploymentStatusDurationCut <-
  cut(pl_subsamp$EmploymentStatusDuration, 
      breaks = c(seq(0, 119, 12), seq(120, 599, 120), 634))
by(pl_subsamp$BorrowerRate, pl_subsamp.EmploymentStatusDurationCut, summary)
## pl_subsamp.EmploymentStatusDurationCut: (0,12]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0495  0.1310  0.1875  0.1953  0.2584  0.3600 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (12,24]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0499  0.1277  0.1775  0.1891  0.2489  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (24,36]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.1314  0.1899  0.1947  0.2566  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (36,48]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0499  0.1314  0.1750  0.1918  0.2587  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (48,60]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0605  0.1435  0.1914  0.1999  0.2599  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (60,72]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0550  0.1325  0.1799  0.1892  0.2419  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (72,84]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0499  0.1239  0.1768  0.1831  0.2344  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (84,96]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0520  0.1314  0.1832  0.1899  0.2414  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (96,108]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0565  0.1350  0.1800  0.1924  0.2499  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (108,120]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0590  0.1270  0.1734  0.1872  0.2478  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (120,240]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0500  0.1350  0.1817  0.1911  0.2489  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (240,360]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0565  0.1355  0.1790  0.1902  0.2480  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (360,480]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0605  0.1242  0.1795  0.1865  0.2340  0.3500 
## -------------------------------------------------------- 
## pl_subsamp.EmploymentStatusDurationCut: (480,634]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0759  0.1612  0.1849  0.2043  0.2662  0.3304
ggplot(aes(x = CurrentCreditLines, y = BorrowerRate),
      data = subset(pl_subsamp, !is.na(CurrentCreditLines))) +
  geom_point(alpha = 1/5, position = "jitter") +
  xlab("Current Credit Lines") +
  ylab("Rate")

with(pl_subsamp, cor.test(CurrentCreditLines, BorrowerRate))
## 
##  Pearson's product-moment correlation
## 
## data:  CurrentCreditLines and BorrowerRate
## t = -9.0742, df = 9356, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.11344951 -0.07328034
## sample estimates:
##         cor 
## -0.09340294

There is no clear pattern.

DebtToIncomeRatio vs. BorrowerRate

ggplot(aes(x = DebtToIncomeRatio, y = BorrowerRate),
       data = subset(pl_subsamp, !is.na(DebtToIncomeRatio))) +
  geom_point(alpha = 1/5) +
  coord_cartesian(xlim = c(0, 1)) +
  xlab("Debt To Income Ratio") +
  ylab("Rate")

with(pl_subsamp, cor.test(DebtToIncomeRatio, BorrowerRate))
## 
##  Pearson's product-moment correlation
## 
## data:  DebtToIncomeRatio and BorrowerRate
## t = 5.4205, df = 9281, p-value = 6.093e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03587453 0.07643211
## sample estimates:
##       cor 
## 0.0561765

The trend is not very marked, but there is a positive correlation between DebtToIncomeRatio and BorrowerRate.

IncomeRange vs. BorrowerRate

with(pl_subsamp, by(BorrowerRate, IncomeRange, summary))
## IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0150  0.1350  0.1900  0.1897  0.2450  0.4500 
## -------------------------------------------------------- 
## IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0495  0.1949  0.2511  0.2385  0.3058  0.3500 
## -------------------------------------------------------- 
## IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0500  0.1400  0.1700  0.1909  0.2444  0.3500 
## -------------------------------------------------------- 
## IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0500  0.1550  0.2130  0.2185  0.2862  0.3500 
## -------------------------------------------------------- 
## IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.1449  0.1995  0.2045  0.2629  0.3600 
## -------------------------------------------------------- 
## IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0499  0.1334  0.1800  0.1911  0.2492  0.3500 
## -------------------------------------------------------- 
## IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0499  0.1249  0.1685  0.1798  0.2290  0.3500 
## -------------------------------------------------------- 
## IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0565  0.1140  0.1520  0.1672  0.2085  0.3500
qplot(x = IncomeRange, y = BorrowerRate,
     data = pl_subsamp, geom = "boxplot",
     fill = IncomeRange) +
  scale_x_discrete(breaks = NULL) +
  xlab("Income Range") +
  ylab("Rate")

The higher the income, the lower the rate. Suprisingly, many customers reporting $0 income have low rates.

StatedMonthlyIncome vs. BorrowerRate

ggplot(aes(x = StatedMonthlyIncome, y = BorrowerRate),
       data = subset(pl_subsamp, !is.na(StatedMonthlyIncome))) +
  geom_point(alpha = 1/5, na.rm = TRUE) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99)) +
  xlab("Stated Monthly Income") +
  ylab("Rate") +
  geom_smooth(method = "lm", color = "red", na.rm = TRUE)

with(pl_subsamp, cor.test(StatedMonthlyIncome, BorrowerRate))
## 
##  Pearson's product-moment correlation
## 
## data:  StatedMonthlyIncome and BorrowerRate
## t = -11.988, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.13831329 -0.09966838
## sample estimates:
##        cor 
## -0.1190359

The trend is not clear, but given the correlation, those reporting higher wages pay less rates.

LoanOriginalAmount vs. BorrowerRate

ggplot(aes(x = LoanOriginalAmount / 1000, y = BorrowerRate),
       data = pl_subsamp) +
  geom_point(alpha = 1/5) +
  xlab("Loan Original Amount, 1K") +
  ylab("Rate") +
  facet_wrap(~IncomeRange)

High-incomes have access to high amounts.

Year/Month vs. BorrowerRate

ggplot(aes(x = MonthCut, y = BorrowerRate),
       data = pl_subsamp.Dates) +
  stat_summary(fun.y = median, geom = "bar") +
  scale_x_date(labels = date_format("%m-%Y"), breaks = "18 months") +
  stat_smooth() +
  xlab("Rate") +
  ylab("Month")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

ggplot(aes(x = YearCut, y = BorrowerRate),
       data = pl_subsamp.Dates) +
  stat_summary(fun.y = median, geom = "bar") +
  scale_x_date(labels = date_format("%Y"), breaks = "1 year") +
  stat_smooth() +
  xlab("Year") +
  ylab("Rate")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Rates reached a peak in 2010/2011 and a bottom in 2008/2014.

MonthlyDebtAmount vs. BorrowerRate

ggplot(aes(x = MonthlyDebtAmount, y = BorrowerRate),
       data = subset(pl_subsamp, !is.na(MonthlyDebtAmount))) +
  geom_point(alpha = 1/5, na.rm = TRUE) +
  xlab("Monthly Debt Amount") +
  ylab("Rate") +
  xlim(0, quantile(pl_subsamp$MonthlyDebtAmount, 0.99, na.rm = TRUE))

There is no clear pattern in this visualisation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

BorrowerRate decreases with the risk profile. Lower the risk, lower is the rate. BorrowerRate increases with the loan’s term. BorrowerRate is low for full/part-time employees and retired, and high for non-employed. BorrowerRate is low for home owners. BorrowerRate is low for high-incomes, high for low-incomes. High-incomes have more access to credit. High-incomes can borrow higher amounts.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

From the scatterplot matrix, we observe a correlation between: - Negative: StatedMonthlyIncome and DebtToIncomeRatio - Positive: StatedMonthlyIncome and LoanOriginalAmount - Positive: StatedMonthlyIncome and CurrentCreditLines - Positive: CurrentCreditLines and MonthlyDebtAmount

ggplot(aes(x = StatedMonthlyIncome, y = DebtToIncomeRatio),
       data = pl_subsamp) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99, na.rm = TRUE)) +
  ylim(0, quantile(pl_subsamp$DebtToIncomeRatio, 0.99, na.rm = TRUE)) +
  geom_point(alpha = 1/5, na.rm = TRUE) +
  geom_smooth(method = "lm", na.rm = TRUE) +
  xlab("Monthly Income") + ylab("Debt Ratio")

with(pl_subsamp, cor.test(StatedMonthlyIncome, DebtToIncomeRatio))
## 
##  Pearson's product-moment correlation
## 
## data:  StatedMonthlyIncome and DebtToIncomeRatio
## t = -10.713, df = 9281, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.13057458 -0.09038547
## sample estimates:
##        cor 
## -0.1105252
ggplot(aes(x = StatedMonthlyIncome, y = LoanOriginalAmount),
       data = pl_subsamp) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99, na.rm = TRUE)) +
  ylim(0, quantile(pl_subsamp$LoanOriginalAmount, 0.99, na.rm = TRUE)) +
  geom_point(alpha = 1/5, na.rm = TRUE) +
  geom_smooth(method = "lm", na.rm = TRUE) +
  xlab("Monthly Income") + ylab("Loan Amount")

with(pl_subsamp, cor.test(StatedMonthlyIncome, LoanOriginalAmount))
## 
##  Pearson's product-moment correlation
## 
## data:  StatedMonthlyIncome and LoanOriginalAmount
## t = 25.708, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2305365 0.2673069
## sample estimates:
##       cor 
## 0.2490114
ggplot(aes(x = StatedMonthlyIncome, y = CurrentCreditLines),
       data = pl_subsamp) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99, na.rm = TRUE)) +
  ylim(0, quantile(pl_subsamp$CurrentCreditLines, 0.99, na.rm = TRUE)) +
  geom_point(alpha = 1/5) +
  geom_smooth(method = "lm") +
  xlab("Monthly Income") + ylab("Credit Lines")
## Warning: Removed 809 rows containing missing values (stat_smooth).
## Warning: Removed 809 rows containing missing values (geom_point).

with(pl_subsamp, cor.test(StatedMonthlyIncome, CurrentCreditLines))
## 
##  Pearson's product-moment correlation
## 
## data:  StatedMonthlyIncome and CurrentCreditLines
## t = 17.091, df = 9356, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1542853 0.1935814
## sample estimates:
##       cor 
## 0.1740026
ggplot(aes(x = CurrentCreditLines, y = MonthlyDebtAmount),
       data = pl_subsamp) +
  xlim(0, quantile(pl_subsamp$CurrentCreditLines, 0.99, na.rm = TRUE)) +
  ylim(0, quantile(pl_subsamp$MonthlyDebtAmount, 0.99, na.rm = TRUE)) +
  geom_point(alpha = 1/5, na.rm = TRUE) +
  geom_smooth(method = "lm", na.rm = TRUE) +
  xlab("Credit Lines") + ylab("Monthly Debt Amount")

with(pl_subsamp, cor.test(CurrentCreditLines, MonthlyDebtAmount))
## 
##  Pearson's product-moment correlation
## 
## data:  CurrentCreditLines and MonthlyDebtAmount
## t = 62.237, df = 8648, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5414627 0.5705766
## sample estimates:
##       cor 
## 0.5561903

More you earn, more borrowing capacity you have. More you earn, more debt you accumulate.

What was the strongest relationship you found?

The strongest relationship I found is between CurrentCreditLines and MonthlyDebtAmount. As you accumulate more debt, your monthly payments increase.

Multivariate Plots Section

Term vs. BorrowerRate by Year

ggplot(aes(x = Term, y = BorrowerRate),
      data = pl_subsamp) +
  geom_boxplot() +
  facet_wrap(~Year) +
  xlab("Rate") + ylab("Term (M)")

Before 2010, there were only 3y loans. In 2010 appeared the first 5y loans. The year after appeared 1y loans. The 5y loans were attractive at the beginning as they offered lower rates than 3y loans. However they increased till surpassing 3y rates, which is more common.

LoanStatus vs. BorrowerRate by Year

qplot(x = LoanStatus, y = BorrowerRate,
       data = pl_subsamp, geom = "boxplot",
      fill = LoanStatus) +
  scale_x_discrete(breaks = NULL) +
  xlab("") + ylab("Term (M)") +
  facet_wrap(~Year)

Unlike what we observed in the bivariate plots, it’s clear here that defaulted loans were borrowed at higher rates.

DebtToIncomeRatio vs. BorrowerRate by Year

ggplot(aes(x = DebtToIncomeRatio, y = BorrowerRate, color = Year),
       data = pl_subsamp) +
  geom_point(na.rm = TRUE) +
  xlab("Debt Ratio") + ylab("Rate") +
  xlim(0, quantile(pl_subsamp$DebtToIncomeRatio, 0.99, na.rm = TRUE)) +
  scale_color_brewer(name = "Prosper Score", type = "div")

ggplot(aes(x = DebtToIncomeRatio, y = BorrowerRate, color = ProsperScore),
       data = subset(pl_subsamp, !is.na(ProsperScore))) +
  geom_point(na.rm = TRUE) +
  xlab("Debt Ratio") + ylab("Rate") +
  xlim(0, quantile(pl_subsamp$DebtToIncomeRatio, 0.99, na.rm = TRUE)) +
  scale_color_brewer(name = "Prosper Score", type = "div") +
  facet_wrap(~Year)

Less risky customers are offered lower rates.

StatedMonthlyIncome vs. BorrowerRate by Year

# Coloring by ProsperScore
ggplot(aes(x = StatedMonthlyIncome / 1000, y = BorrowerRate,
           color = ProsperScore),
       data = pl_subsamp) +
  geom_point(na.rm = TRUE) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome/1000, 0.99)) +
  xlab("Monthly Income, 1K") +
  ylab("Rate") +
  scale_color_brewer(name = "Prosper Score", type = "div") +
  facet_wrap(~Year)

# Coloring by Term
ggplot(aes(x = StatedMonthlyIncome / 1000, y = BorrowerRate,
           color = Term),
       data = pl_subsamp) +
  geom_point(na.rm = TRUE) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome/1000, 0.99, na.rm = TRUE)) +
  xlab("Monthly Income, 1K") +
  ylab("Rate") +
  scale_color_brewer(name = "Term", type = "div") +
  facet_wrap(~Year)

# Coloring by DebtToIncomeRatio
summary(pl_subsamp$DebtToIncomeRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1400  0.2200  0.2741  0.3100 10.0100     717
pl_subsamp.DebtToIncomeRatioCut <-
  cut(pl_subsamp$DebtToIncomeRatio, 
      breaks = c(seq(0, 1, 0.2), 11))
ggplot(aes(x = StatedMonthlyIncome / 1000, y = BorrowerRate,
           color = pl_subsamp.DebtToIncomeRatioCut),
       data = pl_subsamp) +
  geom_point(na.rm = TRUE) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome/1000, 0.99, na.rm = TRUE)) +
  xlab("Monthly Income, 1K") +
  ylab("Rate") +
  scale_color_brewer(name = "Debt Ratio", type = "div") +
  facet_wrap(~Year)

# Coloring by CurrentCreditLines
summary(pl_subsamp$CurrentCreditLines)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   10.00   10.38   13.00   54.00     642
pl_subsamp.CurrentCreditLinesCut <-
  cut(pl_subsamp$CurrentCreditLines, 
      breaks = c(seq(0, 20, 5), 54))
ggplot(aes(x = StatedMonthlyIncome / 1000, y = BorrowerRate,
           color = pl_subsamp.CurrentCreditLinesCut),
       data = pl_subsamp) +
  geom_point(na.rm = TRUE) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome/1000, 0.99, na.rm = TRUE)) +
  xlab("Monthly Income, 1K") +
  ylab("Rate") +
  scale_color_brewer(name = "Credit Lines", type = "div") +
  facet_wrap(~Year)

# Coloring by IsBorrowerHomeowner
ggplot(aes(x = StatedMonthlyIncome / 1000, y = BorrowerRate,
           color = IsBorrowerHomeowner),
       data = pl_subsamp) +
  geom_point(na.rm = TRUE) +
  xlim(0, quantile(pl_subsamp$StatedMonthlyIncome/1000, 0.99, na.rm = TRUE)) +
  xlab("Monthly Income, 1K") +
  ylab("Rate") +
  scale_color_brewer(name = "Home Owner", type = "div") +
  facet_wrap(~Year)

ProsperScore is the only variable drawing a distinct frontier between different levels of rate. To explain the rate by other variables, we should imagine a combination of them. Non owners are mostly on the left of the plot; they have less income, but they could pay lower rates.

MonthlyDebtAmount vs. BorrowerRate by Year

# Coloring by IncomeRange
ggplot(aes(x = MonthlyDebtAmount / 1000, y = BorrowerRate,
           color = IncomeRange),
       data = pl_subsamp) +
  geom_point(alpha = 1/2, size = 2, position = "jitter", na.rm = TRUE) +
  xlim(0, quantile(pl_subsamp$MonthlyDebtAmount/1000, 0.99, na.rm = TRUE)) +
  xlab("Monthly Debt, 1K") +
  ylab("Rate") +
  scale_color_brewer(name = "Monthly Income", type = "div") +
  facet_wrap(~Year)

# DebtToIncomeRatio vs. BorrowerRate by Year
# Coloring by IncomeRange
ggplot(aes(x = DebtToIncomeRatio * 100, y = BorrowerRate,
           color = IncomeRange),
       data = pl_subsamp) +
  geom_point(alpha = 1/2, size = 2, position = "jitter", na.rm = TRUE) +
  xlim(0, quantile(pl_subsamp$DebtToIncomeRatio*100, 0.99, na.rm = TRUE)) +
  xlab("Debt Ratio, %") +
  ylab("Rate") +
  scale_color_brewer(name = "Monthly Income", type = "div") +
  facet_wrap(~Year)

The higher the income, the higher the debt, the less the rate. The higher the income, the less the debt ratio, the less the rate.

Building a Linear Model

m1 <- lm(I(BorrowerRate) ~ I(as.numeric(as.character(Term))), data = pl_subsamp)
m2 <- update(m1, ~ . + DebtToIncomeRatio)
m3 <- update(m2, ~ . + StatedMonthlyIncome)
m4 <- update(m3, ~ . + as.logical(IsBorrowerHomeowner))
m5 <- update(m4, ~ . + as.numeric(as.character(Year)))
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(BorrowerRate) ~ I(as.numeric(as.character(Term))), 
##     data = pl_subsamp)
## m2: lm(formula = I(BorrowerRate) ~ I(as.numeric(as.character(Term))) + 
##     DebtToIncomeRatio, data = pl_subsamp)
## m3: lm(formula = I(BorrowerRate) ~ I(as.numeric(as.character(Term))) + 
##     DebtToIncomeRatio + StatedMonthlyIncome, data = pl_subsamp)
## m4: lm(formula = I(BorrowerRate) ~ I(as.numeric(as.character(Term))) + 
##     DebtToIncomeRatio + StatedMonthlyIncome + as.logical(IsBorrowerHomeowner), 
##     data = pl_subsamp)
## m5: lm(formula = I(BorrowerRate) ~ I(as.numeric(as.character(Term))) + 
##     DebtToIncomeRatio + StatedMonthlyIncome + as.logical(IsBorrowerHomeowner) + 
##     as.numeric(as.character(Year)), data = pl_subsamp)
## 
## =========================================================================================
##                                        m1         m2         m3         m4         m5    
## -----------------------------------------------------------------------------------------
## (Intercept)                         0.185***   0.180***   0.188***   0.193***   1.246    
##                                    (0.003)    (0.003)    (0.003)    (0.003)    (0.639)   
## I(as.numeric(as.character(Term)))   0.000*     0.000*     0.000**    0.000***   0.000*** 
##                                    (0.000)    (0.000)    (0.000)    (0.000)    (0.000)   
## DebtToIncomeRatio                              0.008***   0.006***   0.006***   0.006*** 
##                                               (0.001)    (0.001)    (0.001)    (0.001)   
## StatedMonthlyIncome                                      -0.000***  -0.000***  -0.000*** 
##                                                          (0.000)    (0.000)    (0.000)   
## as.logical(IsBorrowerHomeowner)                                     -0.017***  -0.016*** 
##                                                                     (0.002)    (0.002)   
## as.numeric(as.character(Year))                                                 -0.001    
##                                                                                (0.000)   
## -----------------------------------------------------------------------------------------
## R-squared                               0.001      0.004      0.018      0.030      0.031
## adj. R-squared                          0.000      0.004      0.018      0.030      0.030
## sigma                                   0.074      0.074      0.073      0.073      0.073
## F                                       5.333     17.911     57.352     72.752     58.754
## p                                       0.021      0.000      0.000      0.000      0.000
## Log-likelihood                      11808.097  11046.887  11114.280  11172.349  11173.705
## Deviance                               55.193     50.302     49.577     48.960     48.946
## AIC                                -23610.194 -22085.773 -22218.561 -22332.699 -22333.411
## BIC                                -23588.563 -22057.229 -22182.881 -22289.883 -22283.459
## N                                   10000       9283       9283       9283       9283    
## =========================================================================================
# Model m3
standardLoan <- data.frame(Term = 36, DebtToIncomeRatio = 0.33, 
                           StatedMonthlyIncome = 5000)
riskyLoan <- data.frame(Term = 60, DebtToIncomeRatio = 0.8, 
                           StatedMonthlyIncome = 8000)
lowIncomeLoan <- data.frame(Term = 12, DebtToIncomeRatio = 0.2, 
                           StatedMonthlyIncome = 2000)
extremeLoan <- data.frame(Term = 60, DebtToIncomeRatio = 10, 
                           StatedMonthlyIncome = 5000)

modelSLEstimate <- predict(m3, newdata = standardLoan,
                        interval="prediction", level = .95)
print(modelSLEstimate)
##         fit        lwr       upr
## 1 0.1898526 0.04656084 0.3331444
modelRLEstimate <- predict(m3, newdata = riskyLoan,
                        interval="prediction", level = .95)
print(modelRLEstimate)
##         fit        lwr       upr
## 1 0.1930132 0.04968813 0.3363383
modelLILEstimate <- predict(m3, newdata = lowIncomeLoan,
                        interval="prediction", level = .95)
print(modelLILEstimate)
##         fit        lwr       upr
## 1 0.1887025 0.04535051 0.3320545
modelELEstimate <- predict(m3, newdata = extremeLoan,
                        interval="prediction", level = .95)
print(modelELEstimate)
##         fit      lwr       upr
## 1 0.2521724 0.106355 0.3979898
# Model m5
standardLoan <- data.frame(Term = 36, DebtToIncomeRatio = 0.33, 
                           StatedMonthlyIncome = 5000, IsBorrowerHomeowner = TRUE,
                           Year = 2013)
riskyLoan <- data.frame(Term = 60, DebtToIncomeRatio = 0.8, 
                        StatedMonthlyIncome = 8000, IsBorrowerHomeowner = FALSE,
                        Year = 2009)
lowIncomeLoan <- data.frame(Term = 12, DebtToIncomeRatio = 0.2, 
                           StatedMonthlyIncome = 2000, IsBorrowerHomeowner = FALSE,
                           Year = 2013)
extremeLoan <- data.frame(Term = 60, DebtToIncomeRatio = 10, 
                           StatedMonthlyIncome = 5000, IsBorrowerHomeowner = FALSE,
                          Year = 2009)

modelSLEstimate <- predict(m5, newdata = standardLoan,
                        interval="prediction", level = .95)
print(modelSLEstimate)
##         fit        lwr       upr
## 1 0.1803261 0.03791799 0.3227342
modelRLEstimate <- predict(m5, newdata = riskyLoan,
                        interval="prediction", level = .95)
print(modelRLEstimate)
##         fit       lwr       upr
## 1 0.2051351 0.0626807 0.3475895
modelLILEstimate <- predict(m5, newdata = lowIncomeLoan,
                        interval="prediction", level = .95)
print(modelLILEstimate)
##         fit        lwr       upr
## 1 0.1926385 0.05015241 0.3351246
modelELEstimate <- predict(m5, newdata = extremeLoan,
                        interval="prediction", level = .95)
print(modelELEstimate)
##         fit       lwr       upr
## 1 0.2679155 0.1229845 0.4128464

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Given that rate change over time, we’ve plotted all visualisations by year. We attached a particular focus to StatedMonthlyIncome as it is an important variable when asking for a loan. ProsperScore was the only variable drawing a clear frontier between different rates. All other variables showed fuzzy visualisations and we couldn’t draw any conclusion. However,

Were there any interesting or surprising interactions between features?

The insteresting interaction came from defaulted loans. Indeed, back in the bivariate analysis, we found that loans that resulted in a default were among the loans with lower rates. Visualising by year showed that it’s not true. Most probably many defaults happened when rates were low. Central banks do lower rates in crisis periods in order to boost the ecenomy (low rates, more loans, more consumption and more investing, etc ..).

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I’ve build 2 linear models: - 1st is based on Term, DebtToIncomeRatio, StatedMonthlyAmount - 2nd is based on IsBorrowerHomeowner, Year on top of model 1 Tested on 3 different profiles, they both give close values, ordered as expected: higher rate for risky, less for low income and less for standard profiles. The values of the predicted rate are close (all around 18%/19%) which is around the mean for those profiles. For a confidence interval of 95%, lower and upper values were also close. So this model is good to give an idea of what would be the average rate for a profile. For extreme cases, our extreme profile have a rate 6% to 8% higher than other profiles, and an lower/upper values of 12%/41% for a confidence interval of 95%, wihch is good enough.


Final Plots and Summary

Plot One

Description One

When we plotted in the second section Rate vs. LoanStatus in box plots, we observed that the defaulted loan are having the smaller rates. Those risky loan were granted with confortable rates, which is non sense. Is the bank making a mistake in evaluating its borrowers? Our hypothesis was that most defaulted loans were granted in a period of economic trouble, where rates were systemically low. The plot shows that : - most defaults happended for loan issued between 2006-2008 - most loans were granted between 2012-1013 - mean rates for that periods were close These elements, give to some extent, some credit to our hypothesis.

Plot Two

Description Two

DebtToIncomeRatio seem to be one important feature to look at before asking for a loan. Indeed, the less debt you have, the more leverage you have, and the less risk you carry, which end up with you paying low rates than your counterparts. However, when we look at this plot, we cannot draw any relationship between the debt ratio and the rate. There seems to be a small positive correlation (2011 is much bigger), but not enough to assert with conviction. On the other side, Faceting by ProsperScore, which is the risk property of the loan, we can see clearly that less risky loans pay smaller rates. Obviously, the debt ratio is not enough to measure the riskiness of a loan.

Plot Three

Description Three

High-incomes may not have big debt ratios, but they may pay as much as low-incomes. On the plot, we can see high-incomes on the bottom-left triangle and low incomes on the top-right triangle (especially between 2012 and 2014).


Reflection

So far, we’ve been trying to explain the borrower rate by some of the features provided in this data set. Armed with intuition and common sense, we thought that we’ll find some patterns and trends easily from the bi-variate plots. However, it turned out to be much tougher. Indeed, all patterns have been dug up in the multi-variate plots section. With the selected features, we’ve build a linear model that fit pretty well some profiles. But given the variance of rates for similar profiles, we’ll need to include more features in order to build a better model. This means that we’ll need to select other properties of the loans.